{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 07 - Model Deployment\n",
    "\n",
    "by [Alejandro Correa Bahnsen](http://www.albahnsen.com/) & [Iván Torroledo](http://www.ivantorroledo.com/)\n",
    "\n",
    "version 1.2, Feb 2018\n",
    "\n",
    "## Part of the class [Machine Learning for Risk Management](https://github.com/albahnsen/ML_RiskManagement)\n",
    "\n",
    "This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Agenda:\n",
    "\n",
    "1. Creating and saving a model\n",
    "2. Running the model in batch\n",
    "3. Exposing the model as an API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Phishing Detection\n",
    "\n",
    "Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import zipfile\n",
    "with zipfile.ZipFile('../datasets/model_deployment/phishing.csv.zip', 'r') as z:\n",
    "    f = z.open('phishing.csv')\n",
    "    data = pd.read_csv(f, index_col=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>phishing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>http://www.subalipack.com/contact/images/sampl...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>http://fasc.maximecapellot-gypsyjazz-ensemble....</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>http://theotheragency.com/confirmer/confirmer-...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>http://aaalandscaping.com/components/com_smart...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>http://paypal.com.confirm-key-21107316126168.s...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 url  phishing\n",
       "0  http://www.subalipack.com/contact/images/sampl...         1\n",
       "1  http://fasc.maximecapellot-gypsyjazz-ensemble....         1\n",
       "2  http://theotheragency.com/confirmer/confirmer-...         1\n",
       "3  http://aaalandscaping.com/components/com_smart...         1\n",
       "4  http://paypal.com.confirm-key-21107316126168.s...         1"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>phishing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>39995</th>\n",
       "      <td>http://www.diaperswappers.com/forum/member.php...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39996</th>\n",
       "      <td>http://posting.bohemian.com/northbay/Tools/Ema...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39997</th>\n",
       "      <td>http://www.tripadvisor.jp/Hotel_Review-g303832...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39998</th>\n",
       "      <td>http://www.baylor.edu/content/services/downloa...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39999</th>\n",
       "      <td>http://www.phinfever.com/forums/viewtopic.php?...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     url  phishing\n",
       "39995  http://www.diaperswappers.com/forum/member.php...         0\n",
       "39996  http://posting.bohemian.com/northbay/Tools/Ema...         0\n",
       "39997  http://www.tripadvisor.jp/Hotel_Review-g303832...         0\n",
       "39998  http://www.baylor.edu/content/services/downloa...         0\n",
       "39999  http://www.phinfever.com/forums/viewtopic.php?...         0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    20000\n",
       "0    20000\n",
       "Name: phishing, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.phishing.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['http://dothan.com.co/gold/austspark/index.htm\\n',\n",
       " 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\\n',\n",
       " 'http://verify95.5gbfree.com/coverme2010/\\n',\n",
       " 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\\n',\n",
       " 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\\n',\n",
       " 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\\n',\n",
       " 'http://senevi.com/confirmation/\\n',\n",
       " 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\\n',\n",
       " 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\\n',\n",
       " 'http://alen.co/docs/cleaner\\n',\n",
       " 'http://rattanhouse.co/Atualizacao_Bradesco/cadastro2013.php?2MAS2XACUJPI3U8D9ZDDG2G9YJICVABQ3K73KWDKYK0NA0AWWWCOUEDUJRXHRKPNMUYLDV89RA6OCG2MQUS0TAUXX9IOGJUEIXPDS5B0RM18OF1H860UAMJOY6ICUR81VSEKKJFPBYNLYGUXBGJ1HEHKOMLTM01P658M\\n',\n",
       " 'http://steamcommunily.co/p.php?login=true\\n',\n",
       " 'http://www.nyyg.com/Bradesco/5W9SQ394.html\\n',\n",
       " 'http://wp.tipografiacentral.com.co/sparkde/index.html\\n',\n",
       " 'http://www.entrerev.com/component/.secure.wpa/.www.paypal.com.returnUrl=/cgi-bin/5RF3S6y0K349/PayPal.co.uk/dispute_centre/sotmks/npsw&st.payment.decline.centre/ipoi/secure-codes.paypal.account4738154login.complete-infrmations.login.accountSecure26/securities/\\n',\n",
       " 'http://x.co/SecurCent\\n',\n",
       " 'http://dejatequerer.co/united.com/index.html\\n',\n",
       " 'http://www.speakeasymovies.com/components/com_wrapper/.amazon.co.uk/\\n',\n",
       " 'http://www.culturaespanola.com.br/bt/www.paypal.com/paypal.com.com/index-new.php\\n',\n",
       " 'http://www.agroassistance.com/components/com_content/c05354aa285b6a932a57086ba13762a1/\\n',\n",
       " 'http://www.estranetsrl.com.ar/bbvacambios.html\\n',\n",
       " 'http://osfsw.cba.pl/content/classic/html/ibpf/bradesco/?UOREEIYGQTERIRVSJTUHMVMZJWWYSVNYQOFSPWVFTEJEEKMJWHFERRYTFRWPSYYWGFIGJUPLZMZLTNSKOGMQQSHSXPLMXILVSM\\n',\n",
       " 'http://bitcrush.co/~geetha5/natwest/natwest/ibcarregister-natwst.html\\n',\n",
       " 'http://cannot-hide-from-PhishTank.zenith-services.com/controllare/auth/\\n',\n",
       " 'http://nova.pymesonline.co/fr.php\\n',\n",
       " 'http://comococino.com/wp-content/uploads/2013/01/paypal.com/us/cgi-bin/webscr.htm?\\n',\n",
       " 'http://www.fundacionchwinqlal.com.gt/imgs/Notas/img/_New/Agencias_Bradesco/Public_201133.php?KSR6YOU359CY1USIRMSBI8CFJF7TVREFJ6KIUFKZNXXNRP7JBYVU79APNGJI8YYR5I0YXUXLRU0JKF4WEYQL81BUGVDOTBFXUPVSKSEBNNU84X4IWT54UFYABCY5OE3J5XBOQQ1EDVMHTPZPJ4TEJSOU5NZS32B8ZNWQ\\n',\n",
       " 'http://flightripe.com/confirmation/update/billing/9a523c6017caa3406af9d5c2c0cb1854/\\n',\n",
       " 'http://accademiazerootto.it/templates/zerootto-new/html/com_content/category/bompreco.php\\n',\n",
       " 'http://santanderseguranca.zapto.org/Clientesx/\\n',\n",
       " 'http://www.muttico.com/components/com_media/p3rs0na4l/53f8b14c76c890e1806b8f9d97f12f80/\\n',\n",
       " 'http://us.fxlhtvf.ml/login/en/login.html.asp?refhttp:%2F%2Futddirect.com%2Fcomponents%2Fcom_content%2Fviews%2Fcategories%2Fmenu.html\\n',\n",
       " 'http://conferencistainternacional.com.co/urruirrhyttjk/Index.htm\\n',\n",
       " 'http://www.creativesovereign.com/components/com_newsfeeds/views/.../perfil/\\n',\n",
       " 'http://villamarina.com.co/administrator/servers/BankofAmerica/security-update/SecMeasure/account-overview.cgi/presentation/jskeys/sas/signonScreen.do/\\n',\n",
       " 'http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php\\n',\n",
       " 'http://www.enoxia.fr/components/com_content/tamfidelidade01.php\\n',\n",
       " 'http://gobbva.com/bb/empresa/index.php?tarjeta=\\n',\n",
       " 'http://paypal-com-confim.sharmikelectric.com/s4575234bf5055889415\\n',\n",
       " 'http://paypal.com.au.au.webapps.mpp.homes.konyadosemeciler.com/confirm/login.australia/au/webapps/mpp/home/initthi.php?cmd=SignIn&co_partnerId=2&pUserId=&siteid=0&pageType=&pa1=&i1=&bshowgif=&UsingSSL=&ru=&pp=&pa2=&errmsg=&runame=%5C%5C%5C%5C\\n',\n",
       " 'http://www.bbvabancocontinental.ya.st\\n',\n",
       " 'http://www.giannielectric.com/company/components/com_poll/assets/a/a5643cded2383f7568719482a943e1a5\\n',\n",
       " 'http://cooperativasanjose.com.co/plugins/josetta_ext/k2category/section/first.php\\n',\n",
       " 'http://appleid-apple-com-confirm-oyns-uattw6w61x3oka3pq.scientificcollectables.com/3c43e3d92e0b8a48f09f5fbb25d008a9/index1.php?cmd=https://connect.paypal.com/WebObjects/iTunesConnect.woa?login-processing=t&login_access=13409884065d3a174c294a9bf21bf71c23a3\\n',\n",
       " 'http://consultoriojuridico.co/pp/www.paypal.com/\\n',\n",
       " 'http://lovetodo.in.th/administrator/components/com_content/models/key/\\n',\n",
       " 'http://lnk.co/io6u45y45?erydh?mario.Carelli@poste.it\\n',\n",
       " 'http://www2.bancobbvacontnental.com/Centroll/informe/03/14/datitarlz/WUJFQ0VSUkFATVVOSVpMQVcuQ09N\\n',\n",
       " 'http://lfcintl.com/components/com_user/zzxc/bpd.com.do/app/do/personas/289302294350311363178310441412402464323394411438376403437407/banco.popular.php?Personal\\n',\n",
       " 'http://procuraduria.videoteca.com.co/update/apple.com/.cgi-bin/WebObjects/MyAppleIdwoa/wa/sign_in.html?appId=4129.returnURL=DaHR0cDovL3N0b3JlLmFwcGxlLmNvbS91c3wxYW9zZmU4OGZjNWIyNThhYWVhOTM5MzVjZjI2NTk1OGE3MWUwY2Y0MmI2OA%26r%3DSDHCD9JUYKX777H9KT\\n']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.url[data.phishing==1].sample(50, random_state=1).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Contain any of the following:\n",
    "* https\n",
    "* login\n",
    "* .php\n",
    "* .html\n",
    "* @\n",
    "* sign\n",
    "* ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "keywords = ['https', 'login', '.php', '.html', '@', 'sign']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for keyword in keywords:\n",
    "    data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Lenght of the url\n",
    "* Lenght of domain\n",
    "* is IP?\n",
    "* Number of .com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data['lenght'] = data.url.str.len() - 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "domain = data.url.str.split('/', expand=True).iloc[:, 2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data['lenght_domain'] = domain.str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                    www.subalipack.com\n",
       "1             fasc.maximecapellot-gypsyjazz-ensemble.nl\n",
       "2                                    theotheragency.com\n",
       "3                                    aaalandscaping.com\n",
       "4     paypal.com.confirm-key-21107316126168.securepp...\n",
       "5                              lcthomasdeiriarte.edu.co\n",
       "6                                       livetoshare.org\n",
       "7                                            www.i-m.co\n",
       "8                                     manuelfernando.co\n",
       "9                                www.bladesmithnews.com\n",
       "10                                      www.rasbaek.com\n",
       "11                                      199.231.190.160\n",
       "Name: 2, dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "domain.head(12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data['count_com'] = data.url.str.count('com')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>phishing</th>\n",
       "      <th>keyword_https</th>\n",
       "      <th>keyword_login</th>\n",
       "      <th>keyword_.php</th>\n",
       "      <th>keyword_.html</th>\n",
       "      <th>keyword_@</th>\n",
       "      <th>keyword_sign</th>\n",
       "      <th>lenght</th>\n",
       "      <th>lenght_domain</th>\n",
       "      <th>isIP</th>\n",
       "      <th>count_com</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>28607</th>\n",
       "      <td>http://pennstatehershey.org/web/ibd/home/event...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>80</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3689</th>\n",
       "      <td>http://guiadesanborja.com/multiprinter/muestra...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>81</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6405</th>\n",
       "      <td>http://paranaibaweb.com/faleconosco/accounting...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>65</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35355</th>\n",
       "      <td>http://courts.delaware.gov/Jury%20Services/Hel...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>94</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16520</th>\n",
       "      <td>http://erpa.co/tmp/getproductrequest.htm\\n</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16196</th>\n",
       "      <td>http://pulapulapipoca.com/components/com_media...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>239</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3810</th>\n",
       "      <td>http://www.dag.or.kr/zboard/icon/visa/img/Atua...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>62</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3005</th>\n",
       "      <td>http://www.amazingdressup.com/wp-content/theme...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>94</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9003</th>\n",
       "      <td>http://web.indosuksesfutures.com/content_file/...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>80</td>\n",
       "      <td>25</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34704</th>\n",
       "      <td>http://www.nutritionaltree.com/subcat.aspx?cid...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>69</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12561</th>\n",
       "      <td>http://www.formation-continue-loiret.fr/compon...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>122</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10885</th>\n",
       "      <td>http://191.91.128.205/httpss/bancolombiaa.olb....</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>451</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2633</th>\n",
       "      <td>http://www.sternies-hp.de/components/com_conte...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>85</td>\n",
       "      <td>18</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22253</th>\n",
       "      <td>http://www.silive.com/northshore/index.ssf/200...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>85</td>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4720</th>\n",
       "      <td>http://www.dineo.co.za/components/com_content/...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>172</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     url  phishing  \\\n",
       "28607  http://pennstatehershey.org/web/ibd/home/event...         0   \n",
       "3689   http://guiadesanborja.com/multiprinter/muestra...         1   \n",
       "6405   http://paranaibaweb.com/faleconosco/accounting...         1   \n",
       "35355  http://courts.delaware.gov/Jury%20Services/Hel...         0   \n",
       "16520         http://erpa.co/tmp/getproductrequest.htm\\n         1   \n",
       "16196  http://pulapulapipoca.com/components/com_media...         1   \n",
       "3810   http://www.dag.or.kr/zboard/icon/visa/img/Atua...         1   \n",
       "3005   http://www.amazingdressup.com/wp-content/theme...         1   \n",
       "9003   http://web.indosuksesfutures.com/content_file/...         1   \n",
       "34704  http://www.nutritionaltree.com/subcat.aspx?cid...         0   \n",
       "12561  http://www.formation-continue-loiret.fr/compon...         1   \n",
       "10885  http://191.91.128.205/httpss/bancolombiaa.olb....         1   \n",
       "2633   http://www.sternies-hp.de/components/com_conte...         1   \n",
       "22253  http://www.silive.com/northshore/index.ssf/200...         0   \n",
       "4720   http://www.dineo.co.za/components/com_content/...         1   \n",
       "\n",
       "       keyword_https  keyword_login  keyword_.php  keyword_.html  keyword_@  \\\n",
       "28607              0              0             0              0          0   \n",
       "3689               0              1             1              0          0   \n",
       "6405               0              0             0              1          0   \n",
       "35355              0              0             0              0          0   \n",
       "16520              0              0             0              0          0   \n",
       "16196              0              1             1              0          0   \n",
       "3810               0              0             0              0          0   \n",
       "3005               0              0             0              1          0   \n",
       "9003               0              0             0              0          0   \n",
       "34704              0              0             0              0          0   \n",
       "12561              0              0             0              0          0   \n",
       "10885              1              0             1              1          0   \n",
       "2633               0              0             0              0          0   \n",
       "22253              0              0             0              1          0   \n",
       "4720               0              0             1              0          0   \n",
       "\n",
       "       keyword_sign  lenght  lenght_domain  isIP  count_com  \n",
       "28607             0      80             20     0          0  \n",
       "3689              0      81             18     0          1  \n",
       "6405              0      65             16     0          1  \n",
       "35355             0      94             19     0          0  \n",
       "16520             0      39              7     0          0  \n",
       "16196             0     239             18     0          4  \n",
       "3810              0      62             13     0          0  \n",
       "3005              0      94             22     0          1  \n",
       "9003              0      80             25     0          1  \n",
       "34704             0      69             23     0          1  \n",
       "12561             0     122             32     0          5  \n",
       "10885             0     451             14     1          2  \n",
       "2633              0      85             18     0          2  \n",
       "22253             0      85             14     0          1  \n",
       "4720              0     172             15     0          3  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.sample(15, random_state=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = data.drop(['url', 'phishing'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = data.phishing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.80875, 0.80825, 0.804  , 0.79025, 0.80475, 0.81125, 0.80475,\n",
       "       0.80675, 0.8045 , 0.78925])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cross_val_score(clf, X, y, cv=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "            max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.externals import joblib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../datasets/model_deployment/07_phishing_clf.pkl']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joblib.dump(clf, '../datasets/model_deployment/07_phishing_clf.pkl', compress=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Model in batch\n",
    "\n",
    "See m07_model_deployment.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from m07_model_deployment import predict_proba"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6824999999999997"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: API\n",
    "\n",
    "Flask is considered more Pythonic than Django because Flask web application code is in most cases more explicit. Flask is easy to get started with as a beginner because there is little boilerplate code for getting a simple app up and running."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need to install some libraries \n",
    "\n",
    "```\n",
    "pip install flask-restplus\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load Flask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flask import Flask\n",
    "from flask_restplus import Api, Resource, fields\n",
    "from sklearn.externals import joblib\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create api"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "app = Flask(__name__)\n",
    "\n",
    "api = Api(\n",
    "    app, \n",
    "    version='1.0', \n",
    "    title='Phishing Prediction API',\n",
    "    description='Phishing Prediction API')\n",
    "\n",
    "ns = api.namespace('predict', \n",
    "     description='Phishing Classifier')\n",
    "   \n",
    "parser = api.parser()\n",
    "\n",
    "parser.add_argument(\n",
    "    'URL', \n",
    "    type=str, \n",
    "    required=True, \n",
    "    help='URL to be analyzed', \n",
    "    location='args')\n",
    "\n",
    "resource_fields = api.model('Resource', {\n",
    "    'result': fields.String,\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load model and create function that predicts an URL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = joblib.load('../datasets/model_deployment/07_phishing_clf.pkl') \n",
    "\n",
    "@ns.route('/')\n",
    "class PhishingApi(Resource):\n",
    "\n",
    "    @api.doc(parser=parser)\n",
    "    @api.marshal_with(resource_fields)\n",
    "    def get(self):\n",
    "        args = parser.parse_args()\n",
    "        result = self.predict_proba(args)\n",
    "\n",
    "        return result, 200\n",
    "\n",
    "    def predict_proba(self, args):\n",
    "        url = args['URL']\n",
    "        \n",
    "        url_ = pd.DataFrame([url], columns=['url'])\n",
    "        \n",
    "        # Create features\n",
    "        keywords = ['https', 'login', '.php', '.html', '@', 'sign']\n",
    "        for keyword in keywords:\n",
    "            url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)\n",
    "        \n",
    "        url_['lenght'] = url_.url.str.len() - 2\n",
    "        domain = url_.url.str.split('/', expand=True).iloc[:, 2]\n",
    "        url_['lenght_domain'] = domain.str.len()\n",
    "        url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)\n",
    "        url_['count_com'] = url_.url.str.count('com')\n",
    "\n",
    "        # Make prediction\n",
    "        p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]\n",
    "\n",
    "        print('url=', url,'| p1=', p1)\n",
    "\n",
    "        return {\n",
    "         \"result\": p1\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      " * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)\n",
      "127.0.0.1 - - [07/Feb/2018 15:56:34] \"GET /predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/ HTTP/1.1\" 200 -\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "url= http://consultoriojuridico.co/pp/www.paypal.com/ | p1= 0.32507780944545644\n"
     ]
    }
   ],
   "source": [
    "app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check using \n",
    "\n",
    "* http://localhost:5000/predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
